White Wines Exploration by Jeff Hartl

Guiding Question:

Which chemical properties influence the quality of wines?

Univariate Plots Section

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

There are 4898 observations in the white wines data set, with 13 variables. All the variables are either numeric or integers, and none are factors.

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Quality scores are discrete. The quality distribution is dominated by middling scores: 5’s and 6’s. Three quarters (3655 out of 4898) of the wines are quality 5 or 6. Mean quality is 5.878, median is 6, and 6 is the most common score. 3 was the lowest score, 9 the highest. No wines were rated 0, 1, 2, or 10. There were only five 9’s and only twenty 3’s. 183 wines were scored 4 or less, and another 180 were 8 or higher. That’s only about 7.5% of the total. There’s a significant number of 7’s (about 18% of the total).

The information file for the wines data set states that quality is the output variable, so it would make sense to treat it as a dependent variable. The file refers to quality as a “score”, so I’ll do so too. In the bivariate section I can look at quality across the eleven physicochemical attributes to see if any patterns stand out. I could also look at the average level of each attribute across the seven available quality scores.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed acidity is the amount of tartaric acid; it occurs in grapes naturally.
It looks somewhat normally distributed, with very few outliers and not much variance away from the median of 6.8.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity, the amount of acetic acid, has a mean of 0.2782 g/L. Its distribution is skewed right in the first (top) histogram. The second histogram has a smaller bin width and includes a log10 transformation on the x-axis. I wonder if higher volatile acidity result in lower quality in the data set, since the wines information file says too much acetic acid can create an “unpleasant, vinegar taste”.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Most wines don’t stray far from the mean of citric acid, 0.3342 g/L. However, there are a few outliers.

Looking at a binwidth much smaller than my first citric acid plot, I see a noticable count just under 0.5 g/L. I don’t know if this an anomaly or if citric acid was intentionally added to some wines.

## 
##    0 0.01 0.02 0.03 0.04 0.05 
##   19    7    6    2   12    5
## zero_cit$citric.acid: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   5.000   5.105   6.000   6.000

Since the minimum citric acid is 0, I wondered whether the absence of citric acid influenced quality. However, there are only 19 of them and they are all 4, 5, 6 in quality, averaging around 5, so I don’t see a strong connection between 0 citric acid and quality.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

There’s a tall vertical area in the plot at a very low level of residual sugar (RS), near the 1st Quartile mark of 1.7 g/L. The plot is skewed right, though the counts between about 5 and 15 aren’t all that varied. I think that range of RS values pulls the mean higher despite the high count at that very low RS level. There’s also an outlier of 65.8, which is about 10 times the mean!

##  99% 
## 18.8

By limiting the x-axis beneath 18.8 g/L RS (the 99th percentile), I cut out the outliers and had a closer look at the majority of wines. The plot is still skewed right and there are still high counts at very low RS.

Because the residual sugar histogram was still skewed, I plotted it with a smaller bin width and added a log2 transformation on the x-axis. This looks almost bimodal (perhaps slightly trimodal, even?), with many wines roughly between about 1 and 3, and many others peaking at 8, roughly between 4 and 16; then there’s that outlier at 65.8 (near 64).

That’s why I choose a log base 2 transformation instead of base 10: I noticed peaks, troughs and outliers near 4, 8, 16, 32, and 64.

I did some reading about residual sugar at Wikipedia. I learned that the European Union set labelling standards for RS in wines: up to 4 g/L are “dry”, up to 12 g/L are “medium dry”, up to 45 g/L are “medium”, and above that are “sweet”. Other sweetness valuations can be used as well, and isn’t the last word. But the EU standard roughly coincides with the ups and downs of RS I’m seeing in this distribution.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

This is the distribution of chlorides (sodium chloride, NaCl, salt) in the wines, both raw and log10 transformed. The majority are near the median of 0.043 g/L. I wonder whether small changes in chlorides have an influence on quality.

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

## 
##     2     3     4     5     6     7     8     9    10    11  11.5    12 
##     1    10    11    25    32    25    35    29    55    45     1    51 
##    13    14    15  15.5    16    17    18    19  19.5    20    21    22 
##    55    68    79     1    58    89    80    84     1   101    93   102 
##    23  23.5    24    25    26    27    28  28.5    29    30  30.5    31 
##   110     1   118   111   129    99   112     1   160    99     1   132 
##    32    33    34    35  35.5    36    37    38  38.5    39  39.5    40 
##   109   112   128   129     2   127   111   102     1    89     1   103 
##  40.5    41  41.5    42  42.5    43  43.5    44  44.5    45    46    47 
##     1   104     2    86     1    63     1    75     4   101    64    91 
##    48  48.5    49    50  50.5    51  51.5    52  52.5    53    54    55 
##    66     7    82    64     2    54     1    72     4    68    61    58 
##    56    57    58    59  59.5    60  60.5    61  61.5    62    63    64 
##    42    44    37    39     2    38     2    47     1    29    30    23 
##  64.5    65    66    67    68    69    70  70.5    71    72    73  73.5 
##     1    14    17    22    24    17    11     1     5     6     8     4 
##    74    75    76    77  77.5    78    79  79.5    80    81    82  82.5 
##     5     7     5     5     1     4     2     4     1     7     2     1 
##    83    85    86    87    88    89    93    95    96    97    98   101 
##     4     2     2     4     1     1     1     1     3     1     3     2 
##   105   108   110   112 118.5 122.5   124   128   131 138.5 146.5   289 
##     2     3     1     1     1     1     1     1     1     1     1     1

Free sulfur dioxide (SO2): The median is 34 mg/L (aka parts per million, or ppm) and the mean is 35.31, but there are several outliers above 100 and one huge outlier at 289.

The first histogram plots the entire free.sulfur.dioxide variable. It looked fairly symmetric, but there was a very long tail to the right. In the second plot, I excluded the outliers above 100 ppm and reduced the binwidth. This reveals more symmetry, and also some discreteness (as does the table).

The wines information file states that “at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”. The third quartile is 46 ppm, which tells me winemakers know this and mostly keep free S02 below 50. It will be interesting to see whether there’s a relationship with quality.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The distribution of total sulfur dioxide is roughly symmetric. There’s still a bit of a right skew, and I believe this shape reflects the inclusion of free SO2 in total SO2.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density doesn’t appear to vary much. The mean and median are almost identical. There are a couple of extreme outliers. Density depends on a wine’s alcohol and sugar by volume, according to the wines information file. I wonder if other attributes, such as chlorides, affect a wine’s density as well.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH distribution looks nearly normal. I tried various binwidths, but it still looked pretty symmetric. The mean is only .008 higher than the median of 3.180. The minimum value is 2.72 and the maximum is 3.82.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates are an additive that affects free and total sulfur dioxide. As with free SO2, this plot reveals regular, vertical bars, and like both free and total SO2 the plot is skewed right. Possibly, sulphates affect quality by affecting the other SO2 variables.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol content is expressed as a percentage of alcohol by volume, or “ABV”. In the white wines data set, the ABV ranges between 8% and 14.2%. Although the median is 10.4 and the mean 10.5, the mode appears to be lower than that.

My first alcohol histogram seemed multimodal and right-skewed. I tried transforming alcohol with both log base 10 and 2, but it didn’t change the visualization much. Maybe the distribution isn’t really as skewed as I thought. I wound up just setting a smaller binwidth and a lighter shade, instead of transforming. It makes a difference visually. Alcohol content appears to be an attribute that winemakers control intentionally, but I’m not sure.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in the data set, with 13 variables (X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).

There are no categorical variables. The variables X and quality are integers. The other 11 “physicochemical” variables are numeric.

Each wine in the data set was given an simple identification number – the variable “X” – to anonymize it. I removed X from the data frame, as I saw no use for it in this project.

The “quality” variable is based on a sensory evaluation of the wine’s quality by at least 3 wine experts. Each wine was given a score between 0 and 10; the median of the experts’ scores is the quality score for a given wine. 74.6% of the wines are “normal” in quality (5 or 6). 6 was the most common score, comprising nearly half (45%) of the wines in the data set.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest. The wines information file calls quality an “output attribute”. I view quality as the dependent variable for the purpose of this exploratory analysis. I’d like to see how much the 11 physicochemical attributes influenced the wines’ quality scores.

Quality could be made categorical, and maybe one or more input variables could be turned into a categorical variable, if it will help the analysis.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point, I think volatile acidity, residual sugar, chlorides, free sulfur dioxide, and alcohol might influence the quality scores. This is based mostly on a reading of the wine data set’s informational file.

It’s possible that citric acid, pH, sulphates and density have an influence, too. Citric acid is added for flavor; sulphates affect free SO2; pH may affect sulphates; density depends on alcohol and sugar.

Hopefully the bivariate plots will narrow down which of these features I should focus on. Then I can reassess which features would support my investigation of the main feature, quality.

Did you create any new variables from existing variables in the dataset?

I didn’t create any new variables in this section. I will be creating new variables later on. For example, I’d like to divide up the quality variable into a new, factored variable made up of fewer qualitative categories, such as “poor”, “normal”, and “good”. This is because I think there are too few wines scored below 4 or above 7 in the data set, and grouping them together might make it easier to uncover relationships with the other variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the residual sugar distribution because it was right-skewed. The transformed distribution emphasized that many wines clustered between 1 and 2 g/L, and many other wines clustered between 5 and 16. It made the distribution look bimodal, or even trimodal.

I also log-transformed the free sulfur dioxide distribution because of its long tail and extreme outliers. The transformed distribution did not provide more insight, though.

I also transformed the chlorides distribution because it looked like it had small variance coupled with a long thin right tail.

The citric acid distribution looked roughly normal with small variance, but after reducing the binwidth I noticed a sudden increase in the count at the 0.5 level, significantly above the median.

Bivariate Plots Section

I’m going to compare quality against my other features of interest. Then I’ll compare features other than quality with each other.

First, I’m examining correlation of quality with each of the physicochemical features. I’m running both Pearson’s and Spearman’s correlations, so I can refer to either as needed when looking at the plots that follow. I included Spearman’s because I learned that you can use it when you have ranked data (such as wine quality?).

Correlation Matrix of White Wine Quality with Other Attributes

## [1] "Pearson's r of each wine attribute vs quality"
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,]        -0.114           -0.195      -0.009         -0.098     -0.21
##      free.sulfur.dioxide total.sulfur.dioxide density    pH sulphates
## [1,]               0.008               -0.175  -0.307 0.099     0.054
##      alcohol
## [1,]   0.436
## [1] "Spearman's rho of each wine attribute vs quality"
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## [1,]        -0.084           -0.197       0.018         -0.082    -0.314
##      free.sulfur.dioxide total.sulfur.dioxide density    pH sulphates
## [1,]               0.024               -0.197  -0.348 0.109     0.033
##      alcohol
## [1,]    0.44

For each physicochemical feature, I’d like to compare the feature’s mean for all wines to its mean value by each quality score. I’d also like to see how the means of each feature compare from quality score to quality score. To prevent the table from wrapping over the page, so it’s more concise, I’ll temporarily abbreviate the column names, and round all the means to 3 places.

##     fxa  vola   cit    rs  chlo   fSO2    tSO2  dens    pH  sul   alco
## 1 6.855 0.278 0.334 6.391 0.046 35.308 138.361 0.994 3.188 0.49 10.514
## Source: local data frame [7 x 12]
## 
##   quality   fxa  vola   cit    rs  chlo   fSO2    tSO2  dens    pH   sul
##     <int> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1       3 7.600 0.333 0.336 6.393 0.054 53.325 170.600 0.995 3.188 0.475
## 2       4 7.129 0.381 0.304 4.628 0.050 23.359 125.279 0.994 3.183 0.476
## 3       5 6.934 0.302 0.338 7.335 0.052 36.432 150.905 0.995 3.169 0.482
## 4       6 6.838 0.261 0.338 6.442 0.045 35.651 137.047 0.994 3.189 0.491
## 5       7 6.735 0.263 0.326 5.186 0.038 34.126 125.115 0.992 3.214 0.503
## 6       8 6.657 0.277 0.327 5.671 0.038 36.720 126.166 0.992 3.219 0.486
## 7       9 7.420 0.298 0.386 4.120 0.027 33.400 116.000 0.991 3.308 0.466
## Variables not shown: alco <dbl>.

It might be easier to get the gist of this if I plotted it.

This is a grid of boxplots of wine quality by each of the other features. Red dots indicate the mean of a feature at a given quality score. The blue lines gave me an idea of the overall relationship between quality and each feature. Later, I can I add scatter plots to each boxplot in order to examine relationships in more detail.

I tried to pick the features whose plots appeared to show a relationship to quality. Based both on the correlation matrix and the mean-feature-by-quality plots, I will focus for now on the relationship of quality to these features:
* alcohol
* density
* chlorides
* volatile acidity
* pH
* sulfur dioxide

Quality and Alcohol

## [1] "Summary of alcohol for all wines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20 
## [1] "                                     "
## [1] "Summary of alcohol by quality"
## d[["quality"]]: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## d[["quality"]]: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## d[["quality"]]: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## d[["quality"]]: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## d[["quality"]]: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## d[["quality"]]: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## d[["quality"]]: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## Source: local data frame [7 x 2]
## 
##   quality mean_alcohol
##     <int>        <dbl>
## 1       3         10.3
## 2       4         10.2
## 3       5          9.8
## 4       6         10.6
## 5       7         11.4
## 6       8         11.6
## 7       9         12.2

This is a boxplot of alcohol content (ABV) per quality, laid over a jitter plot of alcohol vs. quality. Diamonds represent mean ABV at each quality score. The horizontal dashed line is at median alcohol for whole data set. A blue trend line is laid over the boxes.

The trend of the boxes goes downward till 5, then goes upward; this is also evident in the table. I detect a scarcity of high-ABV points among quality 3, 4, and 5 wines, and more plentiful high-ABV points among quality 7 & 8 wines. The boxes for quality 7, 8 and 9 wines are clearly higher than the rest, and their medians are at or above the 3rd-Quartile ABV of the lesser-quality wines, even 6’s.

Spearman correlation of quality and alcohol = 0.44

I’m taking a look at summarizing variables and then plotting them. I can group the wines by an input variable (such as alcohol), then plot the mean quality at each unique value of the variable that has been grouped. The resulting scatter plot only shows a trace of horizontal rows of points, which does show an upward trend.

Here’s a scatter plot using the count of wines at each alcohol value to size the points. I added a smoother line to trace a relationship between alcohol content and mean quality. The dashed horizontal line lies at mean quality for all wines. In this plot, as ABV goes up, so does mean quality, generally.

Quality and Density

## [1] "Summary of density for all wines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390 
## [1] "                                     "
## [1] "Summary of density by quality"
## d[["quality"]]: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## d[["quality"]]: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## d[["quality"]]: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## d[["quality"]]: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## d[["quality"]]: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## d[["quality"]]: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## d[["quality"]]: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

The quality 6 wines have a median and mean density identical to that of all wines, so the 6 box and jitter plot makes a nice visual separator of quality 3-5 and quality 7-9.The average densities of wines below quality 6 are slightly higher, and the average densities of wines above quality 6 are obviously lower. Because density has a few exteme outliers causing overplotting, the outliers have been omitted.

Spearman correlation of quality and density = -0.348

I was surprised that density had any relationship to quality, but there appears to be one.

Quality and Chlorides

## [1] "Summary of chlorides for all wines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600 
## [1] "                                     "
## [1] "Summary of chlorides by quality"
## d[["quality"]]: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## d[["quality"]]: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## d[["quality"]]: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## d[["quality"]]: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## d[["quality"]]: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## d[["quality"]]: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## d[["quality"]]: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

There’s a difference in average chlorides when you compare quality 3-6 with 7 and above. Quality 3, 4, and 5 wines must have high chloride outliers: their means are near the top of the box. Median chlorides per quality score trend downward after quality score 5. Mean chlorides per quality score trend more sharply downward. As quality scores improve, the wines get less “salty” on average. There is a relationship to quality, but weaker than alcohol or density.

Spearman correlation of quality and chlorides = -0.314

The majority of wines are within a chloride range of about .025 to .06. This scatter plot drops as NaCl increases until NaCl nears 0.06, and then the plot flattens out into horizontal rows of points across the outliers.

Quality and Acidity

Volatile Acidity

## [1] "Summary of volatile.acidity for all wines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000 
## [1] "                                     "
## [1] "Summary of volatile.acidity by quality"
## d[["quality"]]: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## d[["quality"]]: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## d[["quality"]]: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## d[["quality"]]: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## d[["quality"]]: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## d[["quality"]]: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## d[["quality"]]: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

The quality 4s and 5s have slightly higher median volatile acidity than the other scores. But the 3s, 4s and 5s all have higher mean volatile acidity than the better quality wines. The 3s and 4s volatile acidity varies more than the others too. Maybe poor quality can be explained by high volatile acidity, but it doesn’t look as if better than average quality can be explained by low volatile acidity.

Spearman correlation of quality and volatile acidity = -0.197

Most of the volatile acidity groups are close to the mean and median of quality for all wines, but I do see a mild trend below mean quality the higher the volatile acidity gets.

Quality and pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.215   3.320   3.820

Quality 4, 5, and 6 don’t stray far from average pH. Comparing the numbers in the summary statistics, I see that the quality 7’s, 8’s & 9’s are all higher in pH for each statistic. Curiously the quality 3 wines are also slightly above average pH. Then again, there are many more wines at quality 7-8-9 combined than at 3 alone. Could above-average pH make a small difference in quality?

Spearman correlation of quality and volatile acidity = 0.109

I don’t see much relationship between pH and quality here.

Quality and Sulfur Dioxide

For free SO2, there’s an anonmalous dip at quality 4 again, but the variance appears to be small among all quality scores. The boxes don’t stray far from the overall median free SO2 at any quality other than 4. No apparent relationship to quality.

Spearman correlation of quality and free SO2 = 0.024

Total sulfur dioxide shows less variance among above-average quality wines. Good quality wines have less than average amounts of total SO2 – but oddly enough so do quality 4 wines. But for quality 4, I do see a small negative trend.

Spearman correlation of quality and free SO2 = -0.197

Quality and total SO2 might be weakly related.

When grouped, free and total SO2 have similar-looking relationships to mean quality per SO2 level. However, I think this shows more the relationship of free SO2 to total SO2, rather than a relationship of either to quality directly.

Correlation of White Wine Attributes
Next, I created a correlation matrix for just the physicochemical features, excluding quality. In order to fit the matrix more succinctly on the page, I temporarily replaced the variables’ long names with abbreviations; I also rounded the correlation values to three places.

##        fixac  volac  citra  resid  chlor  frSO2  toSO2   dens     pH
## fixac  1.000 -0.023  0.289  0.089  0.023 -0.049  0.091  0.265 -0.426
## volac -0.023  1.000 -0.149  0.064  0.071 -0.097  0.089  0.027 -0.032
## citra  0.289 -0.149  1.000  0.094  0.114  0.094  0.121  0.150 -0.164
## resid  0.089  0.064  0.094  1.000  0.089  0.299  0.401  0.839 -0.194
## chlor  0.023  0.071  0.114  0.089  1.000  0.101  0.199  0.257 -0.090
## frSO2 -0.049 -0.097  0.094  0.299  0.101  1.000  0.616  0.294 -0.001
## toSO2  0.091  0.089  0.121  0.401  0.199  0.616  1.000  0.530  0.002
## dens   0.265  0.027  0.150  0.839  0.257  0.294  0.530  1.000 -0.094
## pH    -0.426 -0.032 -0.164 -0.194 -0.090 -0.001  0.002 -0.094  1.000
## sulph -0.017 -0.036  0.062 -0.027  0.017  0.059  0.135  0.074  0.156
## alco  -0.121  0.068 -0.076 -0.451 -0.360 -0.250 -0.449 -0.780  0.121
##        sulph   alco
## fixac -0.017 -0.121
## volac -0.036  0.068
## citra  0.062 -0.076
## resid -0.027 -0.451
## chlor  0.017 -0.360
## frSO2  0.059 -0.250
## toSO2  0.135 -0.449
## dens   0.074 -0.780
## pH     0.156  0.121
## sulph  1.000 -0.017
## alco  -0.017  1.000
##       fx v ct r ch fS t d p s a
## fixac 1                   .    
## volac    1                     
## citra      1                   
## resid         1       . +     .
## chlor           1             .
## frSO2              1  ,        
## toSO2         .    ,  1 .     .
## dens          +       . 1     ,
## pH    .                   1    
## sulph                       1  
## alco          . .     . ,     1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Density has a strong correlation with both residual sugar (0.839) and alcohol (0.780). Residual sugar, density, and alcohol all appear interrelated. Residual sugar, density, and alcohol all have moderate correlations with total SO2 (0.401, 0.530, and 0.449). Density and alcohol are both weakly correlated with chlorides.

Free SO2, total SO2, and sulphates are interrelated, of course. Citric acid and total SO2 correlate strongly, which I can’t explain.

The most extreme outliers of residual sugar correspond to the most extreme outliers of density. The trend line for density vs RS shows the positive correlation, too, except at the lowest amounts of RS (where it’s hard to tell, because the points are overplotted).

This plot has omitted the outliers. RS has been tranformed on the x-axis using log2, and now the strong relationship between density and residual sugar appears to be curvilinear in the transformed plot.

The negative relationship between alcohol and density.

Alcohol content is spread out among wines whose residual sugar is below 3 ppm or so. Above 4 ppm, there’s a negative relationship discernable. I didn’t transform RS in this plot because I wanted to emphasize the alcohol variance among low RS wines.

In this plot, RS is log2 transformed. Looks like a positive relationship below RS 4 ppm, and a negative one above that level. And a far outlier that bucks the trend.

Between alcohol and chlorides there’s a negative relationship, but it’s not very strong.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality related more with alcohol than with other features of interest. With the exception of score 5, quality increases as alcohol content increases. When wines were grouped by alcohol content and plotted against mean quality of each group, this relationship was even clearer.

Quality’s Spearman correlation with alcohol was +0.44. Quality with density was -0.35, and quality with chlorides was -0.31.

Quality and density relate negatively, a weak to moderate correlation.

Generally, lower chloride wines get higher quality scores on average. This too is a negative, weak-to-moderate correlation.

I was surprised that volatile acidity didn’t have a stronger influence on quality. I’d assumed it would because the wine information file said high volatile acidity gave a “vinegary” flavor to a wine. There is a weak relationship between high volatile acidity and low quality scores, but there’s a lot of variance within that. And there’s not much relationship at the other end: low volatile acidity didn’t translate to better quality.

Above-average pH levels were associated with quality 7, 8, & 9 wines, but the “worse” wines are more or less normal in pH. Above-average pH conceivably could predict a “good” quality wine, but average or below-average pH wouldn’t necessarily predict “normal” or “poor” quality wines.

Just like volatile acidity, sulfur dioxide failed to have as strong a relationship to quality as I’d assumed.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density has a strong relationship with both residual sugar and alcohol. Density, alcohol, and residual sugar appear interrelated. Alcohol is correlates negatively with density and residual sugar, while density and residual sugar are positively correlated. I was surprised that residual sugar showed up correlating with density and alcohol, given that there was little correlation between residual sugar and quality while density and alcohol do show correlation.

All three of these features have a moderate relationship with total sulfur dioxide. Density and alcohol are also somewhat related with chlorides.

Not surprisingly, there was a correlation among sulphates and free and total sulfur dioxide.

There was a surprising lack of correlation among the “acid” attributes (fixed, volatile, citric, and pH).

What was the strongest relationship you found?

Density is positively and strongly correlated with both residual sugar and alcohol. Perhaps some combination of these variables could be used to predict the quality score of a wine.

The strongest relationship between the output variable (quality) and the input (physicochemical) variables was that of quality and alcohol.

Multivariate Plots Section

This scatter plot shows how white wine quality varies with both alcohol and density, two of the features most related with quality. I excluded a few density outliers. Quality 3 & 4 are grouped, as are 8 & 9, because there are so few 3’s and 9’s. 5 & 6 are separated in order to display more variety of quality.

The points are colored by the quality of the wine at that level of alcohol and density. Shades of blue indicate worse wines; shades of white and red indicate better wines. I also added trend lines of the alcohol-density relationship at each quality factor. The two grey intersecting lines represent the median values of alcohol and density; the lines divide the plot into four quadrants, which helped me visualize the concentrations of points at high and low alcohol and density amounts.

The wines tend to be of better quality in the top half of the plot, especially the top left quadrant. This is where wines with more alcohol and less density are plotted. The bottom right quadrant is different, with lower quality wines predominating at low-alcohol, high-density levels.

The distinction in color among the points is more noticeable horizontally than vertically, with reds across the top of the plot and blues across the bottom. It illustrates how alcohol has a stronger relationship to quality than density does.

The trend lines become steeper as quality improves. This emphasizes that the higher the ABV and the less the density, the more likely it was that a wine would be scored highly.

The plot also displays the strong negative relationship between alcohol and density, in the concentration of points in the top left and bottom right quadrants, along with the direction of the trend lines.

This plot suggests to me that alcohol and density might together predict the quality score of a wine more accurately than each would separately. Alcohol still has more influence on quality, but a high-ABV wine that was also low density would be a stronger predictor of high quality than high-ABV alone.

Do chlorides affect the relationship between alcohol and quality? After all, the correlation between chlorides and quality is not far below that of density and quality (and density is strongly related to alcohol).

For this plot I created a new variable, “NaCl.quartile” that divided chlorides into four quartiles. Chlorides do appear to play a small role in alcohol’s influence on quality at below-median chloride. The plot shows that alcohol is more directly related to high quality among the less salty wines. The trend lines show a linear fit between alcohol and quality when chlorides are below median.

The saltier wines generally have less alcohol, but their point on the plot are concentrated at “normal” quality (5-6) scores. I suspect that high chlorides plus low ABV don’t account for poor quality as much as low chlorides plus high ABV account for good quality.

How about chlorides and density?… The scatter plot in each facet looks reversed compared to the previous plot. This makes sense, since chlorides’ relation is negative with alcohol and positive with density. The density relationship to quality is more regular in the first three quartiles of NaCl (not just the first two as in the alcohol-chlorides plot).

In this plot, I tried to see which feature, chlorides or density, appears more related to quality. I omitted the top and bottom 1% of both features, to zoom in on the majority of wines.

The way I see this plot is that if there’s more separation in color from side-to-side, then chlorides are more related to quality; but if there’s more color separation up-and-down, then density is more related.

It’s almost too close to call, especially because chlorides have such a small variance in value. I think I see slightly more color separation up-and-down than side-to-side, so if I had to choose, I’d go with density.

I had assumed density was a function of chlorides, with saltier wines being denser. However, their correlation is weak, only 0.257. And the wines information file states that density is more a function of residual sugar.

Next, I explored the interrelation among density, residual sugar, and alcohol, which I’d noted in the bivariate section. It was odd that residual sugar wasn’t correlated with quality but the other two were.

This is a grid of scatter plots of density over residual sugar, faceted by every 1% of alcohol content. RS is log2 transformed. The intersecting dashed lines represent median RS and median density for all wines.

I like how this plot shows that:
* The relationship between RS and density is strong and positive (each scatter plot curves upward and has a curvilinear trend line), at all alcohol content.
* The relationship between density and alcohol is strong and negative (the scatter plots “shift” downward toward low density as we move to higher alcohol facets).
* The relationship between RS and alcohol is moderate and negative (the points become more concentrated in the “dryer” left half of each scatter plot as we move to higher alcohol facets).

I wanted to explore alcohol and quality plotted with residual sugar. As quality goes up, the points become predominantly darker (more ABV): this is the expected positive relationship between alcohol and quality.

Generally, as RS increases, the shade of each “bar” of points becomes lighter, meaning there’s less alcohol. At better quality, the high-ABV points cluster more towards the middle of the x-axis. (If RS played no role in ABV’s influence on quality, I would’ve expected these points to stay on the left side of the x-axis at better quality scores).

I think this suggests that residual sugar has some influence on the relationship between alcohol and quality, despite the weak correlation between RS alone with quality.

Next, I explored all three input features (ABV, density, RS) together with quality output.

Here are density, RS, and ABV together. The relationship between density and residual sugar still looks positive and curvilinear. Alcohol’s negative relationship with the other two features is visible in the changing of colors in the points, with low ABV colors predominating at higher density and RS, and high ABV colors predominating toward lower density and RS.

This is the previous scatter plot as faceted by quality. The dashed blue lines mark median RS and density.

Quality scores get better as alcohol content increases and density decreases.

The influence of alcohol on quality shows up as the dominant color of each scatter plot changes from facet to facet. Low-ABV points predominate in poor and average quality facets, and high-ABV points predominate in the good and excellent facets. The quality 6 facet (median quality) shows a more equitable spread of colors.

The influence of density on quality shows up in how the points become more concentrated in the bottom half of each scatter plot (below median density) as we move from poorer to better quality facets.

Residual sugar doesn’t show much effect here. Otherwise, the points would’ve become more concentrated on the left half of each scatter plot toward better quality facets. There’s a hint of it at “good” quality, but that’s all.

So, I’ll try looking at this multivariate relationship in another way.

This is a grid of scatter plots of alcohol vs density, colored by quality and faceted by residual sugar. The intersecting green lines are median alcohol and median density. Trend lines for alcohol vs density are colored by each level of quality.

The predominance of good wine quality points in the upper left quadrant of each scatter plot highlights the relationship of alcohol and density to quality. Conversely, average and poor wine quality points predominate in the lower right quadrants (less ABV, more density).

I now see the effect that residual sugar has on the relationships of alcohol and density to quality. Dry wines (RS 1.2-4) at high ABV & low density have better quality scores. Medium-dry wines (RS 4-12) show a similar tendency, though not as strong. Conversely, in the RS <= 1.2 and RS > 12 facets, poorer wines predominate. This suggests the experts didn’t care as much for wines that were too dry or too-sweet, especially the wines highest in residual sugar (RS > 12).

Of course, the residual sugar levels that I used for categorizing sweetness could be adjusted, or the RS variable could be divided into more than four categories. There’s more than one standard of what level of RS constitutes a dry, medium, or sweet wine. This is just one illustration, based on my earlier explorations.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol and density strengthened each other in terms of wine quality. Higher alcohol and lower density together showed a positive relationship to quality.

Chlorides strengthen the relationship of either alcohol or density to quality only among the least salty wines.

Residual sugar has some effect on quality when it’s seen in conjunction with alcohol and density. That was a surprise, given the extremely low correlation between RS and quality.

Were there any interesting or surprising interactions between features?

I was surprised that chlorides didn’t have a greater effect on density (and quality) than I’d supposed. The explanation is that density is more a function of residual sugar than of chlorides.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I did not create any models for the data set. However, I made some tentative suggestions about which features to choose for a model of the influence of physicochemical attributes on quality output. Those would be alcohol, density, residual sugar, and chlorides.


Final Plots and Summary

Plot One

Description One

The distribution of residual sugar (RS) is concentrated in two major ranges. This makes it appear broadly bimodal. There are many wines with an RS between 1 and 3 g/L, fewer between 3 and 4 g/L, and many others between 4 and 18.

There appear to be peaks and troughs and outliers of RS near 1, 1.4, 2, 4, 8, 16, 32, 64 – all of which are powers of 2. Perhaps there’s a sort of logarithmic reduction in RS as a wine ferments. It’s not precisely so, but the distribution could be described as somewhat tri-modal or even multimodal, roughly corresponding with some common ideas of what defines a wine as “dry”, “medium”, or “sweet”.

Plot Two

Description Two

The correlation between wine density and residual sugar is strong (Pearson’s r = 0.839); it was the strongest correlation I discovered between and two variables in the data set.

Residual sugar has been log2 tranformed on the x-axis. A curvilinear regression line for density vs RS appears to be the best fit. For comparison, a linear model line has been added. In the bivariate section, my plotting of density and residual sugar led me to include residual sugar in my multivariate explorations.

Plot Three

Description Three

This scatter plot shows how white wine quality varies with both alcohol and density.

The higher the alcohol by volume and the less the density, the more likely it was that a wine would be scored highly. Better quality wines predominate at high ABV and low density. Average and lower quality wines predominate at low ABV and high density.

Alcohol has a stronger relationship to quality than density does. The color separation in the plot is more horizontal than vertical, with reds dominating the top half of the plot and blues dominating the bottom.

The plot’s slope illustrates the strong negative relationship between alcohol and density.

The plot suggests that alcohol and density together might predict the quality score of a wine more accurately than either would separately. A high-ABV wine that was also low density would be a stronger predictor of high quality than high-ABV alone.


Reflection

I explored the features of 4898 white wines in the “Portuguese Vinho Verde” data set, published in 2009. There were 13 variables. Eleven variables were “physicochemical” input attributes of wine. One variable contained anonymous identification numbers for the wines, which I removed. There was one output variable, “quality”, a sensory scoring of each wine by wine experts. Quality stood out as a variable to make categorical, since it was discrete and ranked.

I set out to discover what, if any, influence the eleven input attributes had on wine quality scores. Almost 75% of the wines were quality 5 or 6: so, not a large proportion of poor or excellent wines to explore. Later, I combined the 3 and 4 scores together into a single category, as well as the 8 and 9 scores.

The information file that came with the data set gave me a few leads about features that might affect quality. Some of these, such as volatile acidity, citric acid, and free sulfur dioxide, proved to my surprise to be uninstructive. I discovered features of interest that I hadn’t anticipated being related to quality: in particular, alcohol and density. In fact, I was surprised to find any relationship between density and quality. I thought that residual sugar (RS) would influence quality, but it appeared to have no direct correlation to quality. It was only after I discovered an interrelationship between alcohol, density, and the log base 2 of residual sugar, that I was able to see RS indirectly influencing wine quality (through its relationship to alcohol and density).

The feature with the greatest influence on quality in the wine data set was alcohol content. I came to believe that any model of wine attributes that could predict quality scores would have to include this feature. Density, residual sugar, and low chlorides might also be useful in predicting wine quality.

However, I didn’t discover any single feature or combination of features that would make a strong predictor of wine quality scores – at least not in this data set. Given a data set with more wines and a greater spread of quality scores, maybe stronger relationships would emerge. On a personal note, my limited experience and knowledge about wine might have led me down some dead end tracks, and I may have missed some connections. I’d have to develop more subject matter knowledge before feeling confident in setting up any models. Nevertheless, this exploration was a good start.